Taking Uncertainty Seriously: Bayesian Marginal Structural Models for Causal Inference

A. Jordan Nafa and Andrew Heiss

University of North Texas and Georgia State University

September 16th, 2022

Introduction

Causal Inference in Political Science

  • Cross-sectional time series data and causal inference

    • Causal inference literature in political science focuses largely on frequentist econometrics (i.e., Imai and Kim 2019, 2020)

      • Still relies on strict exogeneity assumption
    • Recent works drawing on approaches developed in biostatistics outline a framework for estimating causal effects under the relatively weaker assumption of sequential ignorability (Acharya, Blackwell, and Sen 2016; Blackwell and Glynn 2018)

  • Our goal in this paper is to extend the approach to causal inference under selection on observables introduced by Blackwell and Glynn (2018) to a Bayesian framework

Marginal Structural Models

  • Marginal structural models (MSMs) are a multi-stage approach to estimating causal effects where baseline random assignment is not possible (Robins 1997; Robins, Hernán, and Brumback 2000)

    • Relies on inverse probability of treatment weighting to achieve covariate balance by constructing pseudo-populations (Cole and Hernán 2008; Imai and Ratkovic 2015; Saarela et al. 2015)

    • Adjusting for biasing paths in the propensity model allows for identification of causal effects that are impossible to estimate in a single model due to post-treatment bias

    • Possible to estimate lagged effects and “treatment histories” in cross-sectional time series data under complex temporal dependence (Blackwell and Glynn 2018)

Why Bayesian Estimation?

  • Frequentist uncertainty estimates are based on assumptions about sampling distributions

    • Yet, in many areas of political science our data comprise an apparent population rather than a random sample from a larger population of interest (Berk, Western, and Weiss 1995; Gill 2001)

      • For example, all sovereign countries between 1945 and 2020 or all states in U.S. over some time period
    • It doesn’t make sense to think in terms of random samples from a population if your observed data is the population (Gill and Heuberger 2020; Western and Jackman 1994)

Why Bayesian Estimation?

  • A Bayesian framework provides a straightforward approach to accounting for and propagating uncertainty in the specification of the propensity model

    • Bayesian Model Averaging (BMA) and cross-validation based stacking approaches allow us to avoid choosing a single specification for the propensity model (Kaplan and Chen 2014; Corwin Matthew Zigler and Dominici 2014)

    • Acknowledges that we are virtually always uncertain about what the true set of confounders is

    • May help reduce the degree to which our results depend on the propensity model being correctly specified (Hahn, Murray, and Carvalho 2020)

Why Bayesian Estimation?

A Bayesian Pseduo-Likelihood Approach

\[ \definecolor{treat}{RGB}{27,208,213} \definecolor{outcome}{RGB}{98,252,107} \definecolor{baseconf}{RGB}{244,199,58} \definecolor{covariates}{RGB}{178,26,1} \definecolor{index}{RGB}{37,236,167} \definecolor{timeid}{RGB}{244,101,22} \definecolor{mu}{RGB}{71,119,239} \definecolor{sigma}{RGB}{219,58,7} \newcommand{normalcolor}{\color{white}} \newcommand{treat}[1]{\color{treat} #1 \normalcolor} \newcommand{resp}[1]{\color{outcome} #1 \normalcolor} \newcommand{conf}[1]{\color{baseconf} #1 \normalcolor} \newcommand{covar}[1]{\color{covariates} #1 \normalcolor} \newcommand{obs}[1]{\color{index} #1 \normalcolor} \newcommand{tim}[1]{\color{timeid} #1 \normalcolor} \newcommand{mean}[1]{\color{mu} #1 \normalcolor} \newcommand{vari}[1]{\color{sigma} #1 \normalcolor} \]

Bayesian Design Stage Estimation

For some binary treatment \(\treat{X}_{\obs{i}\tim{t}}\), the posterior expectation of the stabilized inverse probability of treatment weights for each unit \(\obs{i}\) at time \(\tim{t}\) is

\[ \text{IPW}_{\obs{i}\tim{t}} = \prod^{\tim{t}}_{\tim{t} = \tim{1}} \frac{\int\Pr[\treat{X}_{\obs{i}\tim{t}}~ | ~\treat{X}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}]\pi(\theta)d\theta}{\int\Pr[\treat{X}_{\obs{i}\tim{t}}~ |~\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}]\pi(\theta)d\theta} \]

  • \(\treat{X}_{\obs{i}\tim{t-1}}\) and \(\resp{Y}_{\obs{i}\tim{t-1}}\) denote the treatment status and outcome for the \(\obs{i^{th}}\) unit in the previous period respectively

  • \(\conf{C}_{\obs{i}}\) is a set of time-invariant baseline covariates

  • \(\covar{Z}_{\obs{i}\tim{t}}\) is a set of time-varying covariates that satisfies sequential ignorability

  • Although we focus mainly on the average treatment effect at times \(\tim{t}\) and \(\tim{t-1}\), it is possible to estimate longer lags and other estimands as well.

Bayesian Design Stage Estimation

It is also possible to extend IPTW to cases in which \(\treat{X}_{\obs{i}\tim{t}}\) is continuous, in which case the stabilized weights are

\[\text{IPW}_{\obs{i}\tim{t}} = \prod^{\tim{t}}_{\tim{t} = \tim{1}} \frac{f_{\treat{X}_{\obs{i}\tim{t}} | \treat{X}_{\obs{i}\tim{t-1}},\conf{C}_{\obs{i}}}[(\treat{X}_{\obs{i}\tim{t}}~ | ~\treat{X}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}); ~\mean{\mu}, ~\vari{\sigma^{2}}]}{f_{\treat{X}_{\obs{i}\tim{t}} |\covar{Z}_{\obs{i}\tim{t}}, \treat{X}_{\obs{i}\tim{t-1}}, \resp{Y}_{\obs{i}\tim{t-1}}, \conf{C}_{\obs{i}}}[(\treat{X}_{\obs{i}\tim{t}}~ |~\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}},~ \conf{C}_{\obs{i}}); ~\mean{\mu}, ~\vari{\sigma^{2}}]} \]

  • Each of the parameters \(\treat{X}\), \(\resp{Y}\), \(\covar{Z}\), and \(\conf{C}\) in the numerator and denominator are the same as in the binary version

  • The \(f_{\dots}(\cdot)\) expressions represent a probability density function with mean \(\mean{\mu}\) and variance \(\vari{\sigma^{2}}\)

  • We’ll focus mainly on binary treatment regimes, though this particular method tends to behave better for a continuous \(\treat{X}\) in some cases

The Bayesian Pseudo-Likelihood

To propagate uncertainty in the distribution of weights from the design stage while avoiding the problem of feedback inherent in joint estimation, we develop a Bayesian Pseudo-Likelihood estimator (Savitsky and Toth 2016; Williams and Savitsky 2020a, 2020b)

\[\begin{align} \hat{\pi}( \theta~|~y, \tilde{w}) ~\propto~ \left [\prod_{i = 1}^{n} \Pr(y_{i} ~|~ \theta)^{\tilde{w_{i}}}\right ]\pi(\theta) \end{align}\]
  • \(\tilde{w_{i}}\) is the realized IPT weight for the \(i^{th}\) observation

  • \(\prod_{i = 1}^{n} \Pr(y_{i} ~|~ \theta)^{\tilde{w_{i}}}\) is the pseudo-likelihood and \(\pi\) denotes the prior probability for a parameter \(\theta\)

  • \(\hat{\pi}( \theta~|~y, \tilde{w})\) represents the Bayesian pseudo-posterior for \(\theta\)

Parameterization of the Weights

  • We decompose the matrix of weights from the design stage into a location component \(\lambda\) and a scale component \(\delta\)

  • The weight for each observation is sampled as \[\tilde{w}_{\obs{i}\tim{t}} \sim \lambda_{\obs{i}\tim{t}} + \delta_{\obs{i}\tim{t}} \cdot \pi(\delta_{\obs{i}\tim{t}})\] where \(\pi(\delta_{\obs{i}\tim{t}})\) is a regularizing prior on the scale of the weights such as an exponential distribution with rate \(\lambda > 3.5\) or Beta distribution with shape parameters \(\alpha = 2\) and \(\beta \ge 2\)

  • Provides computational stability and shuts down extreme values when the IPT weights have high variance

  • Straightforward extensions for nested data structures via double-weighted estimation (Savitsky and Williams 2021)

Simulation Study Design

Simulation Study Overview

  • To assess parameter recovery and bias, we adapt the original simulation design from Blackwell and Glynn (2018)

  • We simulate 2000 data sets of varying dimensions, manipulating the path \(\treat{X}_{\obs{i}\tim{t-1}} \longrightarrow \covar{Z}_{\obs{i}\tim{t}}\)

    • Periods \(\in \{20, 50\}\)

    • Groups \(\in \{25, 45, 65, 85, 100\}\)

  • Objectives

    • Identify both \(\treat{X}_{\obs{i}\tim{t}} \longrightarrow \resp{Y}_{\obs{i}\tim{t}}\) and \(\treat{X}_{\obs{i}\tim{t-1}} \longrightarrow \resp{Y}_{\obs{i}\tim{t}}\)

    • Compare our Bayesian Pseduo-Likelihood approach against the more common auto-regressive distributed lag (ARDL) specification

DAG for the Simulated Data

ARDL Model Specification

\[\begin{align} \resp{y}_{\obs{i}\tim{t}} &\sim \textit{Normal}(\mu_{\obs{i}\tim{t}}, \epsilon^{2})\\ &\mu_{\obs{i}\tim{t}} = \alpha + \beta_{1}\treat{X}_{\obs{i}\tim{t}} + \beta_{2}\treat{X}_{\obs{i}\tim{t-1}} + \beta_{3}\resp{Y}_{\obs{i}\tim{t-1}} + \beta_{4}\resp{Y}_{\obs{i}\tim{t-2}} +\\ & \quad \beta_{5}\covar{Z}_{\obs{i}\tim{t}} + \beta_{6}\covar{Z}_{\obs{i}\tim{t-1}} + \epsilon\\ \text{with priors}\\ \alpha &\sim \textit{Normal}(\bar{y}, ~ 2 \cdot \sigma_{y}) \quad\quad\quad \beta_{k} \sim \textit{Normal}\left(0, ~ 1.5 \cdot \frac{\sigma_{y}}{\sigma_{x}}\right)\\ \epsilon &\sim \textit{Exponential}\left(\frac{1}{\sigma_{y}}\right) &\\ \end{align}\]

MSM Design Stage Specification

As illustrated in the equation for the stabilized weights, we specify two separate models for the numerator and denominator with weakly informative independent normal priors on \(\alpha\) and \(\beta\)

\[\begin{align} \Pr(\treat{X}_{\obs{i}\tim{t}} = 1 ~|~ \theta_{\obs{i}\tim{t}}) &\sim \textit{Bernoulli}(\theta_{\obs{i}\tim{t}})\\ &\theta_{\obs{i}\tim{t}} = \text{logit}^{-1}(\alpha + X_{n}\beta_{k})\\ \text{with priors}\\ \alpha &\sim \textit{Normal}(0, ~2) \quad \quad \beta_{k} \sim \textit{Normal}(0,~ 1)\\ \end{align}\]
  • For the numerator model, the matrix \(X_{n}\) is simply \(\treat{X}_{\obs{i}\tim{t-1}}\)

  • For the denominator model, \(X_{n} = \{\covar{Z}_{\obs{i}\tim{t}}, ~ \treat{X}_{\obs{i}\tim{t-1}},~ \resp{Y}_{\obs{i}\tim{t-1}}\}\)

MSM Outcome Model Specification

\[\begin{align} \resp{y}_{\obs{i}\tim{t}} &\sim \textit{Normal}(\mu_{\obs{i}\tim{t}}, \epsilon^{2})^{\tilde{w}_{\obs{i}\tim{t}}}\\ &\mu_{\obs{i}\tim{t}} = \alpha + \beta_{1}\treat{X}_{\obs{i}\tim{t}} + \beta_{2}\treat{X}_{\obs{i}\tim{t-1}} + \epsilon & \\ \text{where}\\ \tilde{w}_{\obs{i}\tim{t}} &\sim \lambda_{\obs{i}\tim{t}} + \delta_{\obs{i}\tim{t}} \cdot \pi{(\delta)}\\ \text{with priors}\\ \alpha &\sim \textit{Normal}(\bar{y}, ~ 2 \cdot \sigma_{y}) \quad \quad \beta_{k} \sim \textit{Normal}\left(0, ~ 1.5 \cdot \frac{\sigma_{y}}{\sigma_{x}}\right)\\ \epsilon &\sim \textit{Exponential}\left(\frac{1}{\sigma_{y}}\right) \quad \quad \delta_{\obs{i}\tim{t}} \sim \textit{Beta}(2, ~ 5)\\ \end{align}\]

Simulation Results

Simulation Results

Simulation Results

Simulation Results

Simulation Results

Simulation Results

Conclusions

Conclusion

  • Overall, our proposed procedure performs well in terms of parameter recovery under fairly general conditions

  • Going forward, we need to apply this to some real world political science examples

  • Planned R package implementing our procedure by building on the brms package as a backend

    • Makes it super easy for anyone who knows standard R model syntax to use

References

Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. Explaining Causal Findings Without Bias: Detecting and Assessing Direct Effects.” American Political Science Review 110(3): 512–29.
Berk, Richard A., Bruce Western, and Robert E. Weiss. 1995. Statistical Inference for Apparent Populations.” Sociological Methodology 25: 421.
Blackwell, Matthew, and Adam N. Glynn. 2018. How to Make Causal Inferences with Time-Series Cross-Sectional Data Under Selection on Observables.” American Political Science Review 112: 1067–82.
Cole, Stephen R., and Miguel A. Hernán. 2008. Constructing Inverse Probability Weights for Marginal Structural Models.” American Journal of Epidemiology 168(6): 656–64.
Forastiere, Laura, Alessandra Mattei, and Peng Ding. 2018. Principal Ignorability in Mediation Analysis: Through and Beyond Sequential Ignorability.” Biometrika 105(4): 979–86.
Gill, Jeff. 2001. Whose Variance Is It Anyway? Interpreting Empirical Models with State-Level Data.” State Politics & Policy Quarterly 1(3): 318–38.
Gill, Jeff, and Simon Heuberger. 2020. “Bayesian Modeling and Inference: A Post-Modern Perspective.” In The SAGE Handbook of Research Methods in Political Science and International Relations, eds. Luigi Curini and Robert Franzese. London, UK: SAGE, 961–84.
Hahn, P. Richard, Jared S. Murray, and Carlos M. Carvalho. 2020. Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion).” Bayesian Analysis 15(3).
Imai, Kosuke, and In Song Kim. 2019. When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? American Journal of Political Science 63(2): 467–90.
———. 2020. On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data.” Political Analysis 29(3): 405–15.
Imai, Kosuke, and Marc Ratkovic. 2015. Robust Estimation of Inverse Probability Weights for Marginal Structural Models.” Journal of the American Statistical Association 110(511): 1013–23.
Kaplan, David, and Jianshen Chen. 2014. Bayesian Model Averaging for Propensity Score Analysis.” Multivariate Behavioral Research 49(6): 505–17.
Kurtz, Marcus J., and Adam Lauretig. 2021. Does Free-Market Reform Induce Protest? Selection, Post-Treatment Bias, and Depoliticization.” British Journal of Political Science 52(2): 968–76.
Ladam, Christina, Jeffrey J. Harden, and Jason H. Windett. 2018. “Prominent Role Models: High‐profile Female Politicians and the Emergence of Women as Candidates for Public Office.” American Journal of Political Science 62: 369–81.
Liao, Shirley X., and Corwin M. Zigler. 2020. Uncertainty in the Design Stage of Two-Stage Bayesian Propensity Score Analysis.” Statistics in Medicine 39(17): 2265–90.
Liu, Licheng, Ye Wang, and Yiqing Xu. 2022. A Practical Guide to Counterfactual Estimators for Causal Inference with Time-Series Cross-Sectional Data.” American Journal of Political Science.
Mogstad, Magne, and Alexander Torgovitsky. 2018. Identification and Extrapolation of Causal Effects with Instrumental Variables.” Annual Review of Economics 10(1): 577–613.
Robins, James M. 1997. “Causal Inference from Complex Longitudinal Data.” In Latent Variable Modeling and Applications to Causality, Lecture notes in statistics, ed. Maia Berkane. New York: Springer-Verlag, 69–117.
Robins, James M., Miguel Ángel Hernán, and Babette Brumback. 2000. Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11(5): 550–60.
Robins, James M., Miguel A. Hernán, and Larry Wasserman. 2015. Discussion of ‘on Bayesian Estimation of Marginal Structural Models’.” Biometrics 71(2): 296–99.
Saarela, Olli, David A. Stephens, Erica E. M. Moodie, and Marina B. Klein. 2015. On Bayesian Estimation of Marginal Structural Models.” Biometrics 71(2): 279–88.
Savitsky, Terrance D., and Daniell Toth. 2016. Bayesian Estimation Under Informative Sampling.” Electronic Journal of Statistics 10(1).
Savitsky, Terrance D., and Matthew R. Williams. 2021. Pseudo Bayesian Mixed Models Under Informative Sampling.” arXiv.
Swamy, P. A. V. B., George Tavlas, and Stephen Hall. 2015. On the Interpretation of Instrumental Variables in the Presence of Specification Errors.” Econometrics 3(1): 55–64.
Western, Bruce, and Simon Jackman. 1994. “Bayesian Inference for Comparative Research.” American Political Science Review 88: 412–23.
Williams, Matthew R., and Terrance D. Savitsky. 2020a. Bayesian Estimation Under Informative Sampling with Unattenuated Dependence.” Bayesian Analysis 15(1).
———. 2020b. Uncertainty Estimation for Pseudo-Bayesian Inference Under Complex Sampling.” International Statistical Review 89(1): 72–107.
Zigler, Corwin M. et al. 2013. Model Feedback in Bayesian Propensity Score Estimation.” Biometrics 69(1): 263–73.
Zigler, Corwin Matthew, and Francesca Dominici. 2014. Uncertainty in Propensity Score Estimation: Bayesian Methods for Variable Selection and Model-Averaged Causal Effects.” Journal of the American Statistical Association 109(505): 95–107.